The Following Histogram is showing the distribution of Arrival Delays having positive skew
library(ggplot2)
ggplot(hflights, aes(ArrDelay)) + geom_histogram() + xlim(-50, 250)
The Following Histogram is showing the distribution of Departure Delays also having positive skew
ggplot(hflights, aes(DepDelay)) + geom_histogram() + xlim(-50, 250)
h=subset(hflights,select=c(ArrDelay,DepDelay))
h1=na.omit(h)
x= h1$ArrDelay
y= h1$DepDelay
length(x)
## [1] 223874
length(y)
## [1] 223874
quantile(x)
## 0% 25% 50% 75% 100%
## -70 -8 0 11 978
quantile(y)
## 0% 25% 50% 75% 100%
## -33 -3 0 9 981
Based on data above, We will be using 11 for 3rd Quartile of x and 0 for 2nd quartile of y
Next we are Calculating Values for x/y table
p1=nrow(subset(h1, ArrDelay <= 11 & DepDelay <= 0))
p1
## [1] 108141
p2=nrow(subset(h1, ArrDelay <= 11 & DepDelay > 0))
p2
## [1] 61026
p3=nrow(subset(h1, ArrDelay > 11 & DepDelay <= 0))
p3
## [1] 6159
p4=nrow(subset(h1, ArrDelay > 11 & DepDelay > 0))
p4
## [1] 48548
Table of counts
table = matrix(c(p1,p3,p1+p3,p2,p4,p2+p4,p1+p2,p3+p4,p1+p2+p3+p4), nrow=3, ncol=3)
colnames(table) = c("<=Q2",">Q2","Total")
rownames(table) = c("<=Q3",">Q3","Total")
table
## <=Q2 >Q2 Total
## <=Q3 108141 61026 169167
## >Q3 6159 48548 54707
## Total 114300 109574 223874
Calculating a. P(X>x | Y>y)
a=(48548/223874)/.5
a
## [1] 0.4337082
Calculating b.P(X>x, Y>y)
b=(54707/223874)*(109574/223874)
b
## [1] 0.1196033
Calculating P (X <x | Y<y)
c =(61026/223874)/.5
c
## [1] 0.5451817
P(A|B)=P(A)P(B)?
A = 54707/223874
B = 109574/223874
Calculate P(A|B)
(48548/223874)/.5
## [1] 0.4337082
Calculate P(a)*P(B)
A*B
## [1] 0.1196033
Based on above calculations we can conclude that P(A|B) != P(A)P(B)
t1=c(x,y)
t2=table(t1)
chisq.test(t2)
##
## Chi-squared test for given probabilities
##
## data: t2
## X-squared = 5952700, df = 519, p-value < 2.2e-16
Based on above Chi-squared test we see that p<0.05, therefore we reject the Hypotheses that Arrival Delay and Departure Delay are independent.
Below is a Scatter Plot of Arrival Delay and Departure Delay
library(plotly)
plot_ly(data = h1, x = ArrDelay, y =DepDelay, mode = "markers")
Scatter Plot sugest a strong correation between the delays times.
Next we calculate the 95% Confedence interval for the difference of the means
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -26.106, df = 445800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.494880 -2.146418
## sample estimates:
## mean of x mean of y
## 7.094334 9.414983
Derive a correlation matrix for two of the quantitative variables
h=subset(hflights,select=c(ArrDelay,DepDelay))
h1=na.omit(h)
cor(h1$ArrDelay, h1$DepDelay)
## [1] 0.9292181
corm = matrix(c(1,0.929,0.929,1),nrow=2,ncol=2)
corm
## [,1] [,2]
## [1,] 1.000 0.929
## [2,] 0.929 1.000
Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval
cor.test(h1$ArrDelay, h1$DepDelay, conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: h1$ArrDelay and h1$DepDelay
## t = 1189.8, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.9284710 0.9299578
## sample estimates:
## cor
## 0.9292181
Based on above we reject the null hypothesis that the correleation between the variables is 0.
precm = solve(corm)
precm
## [,1] [,2]
## [1,] 7.301455 -6.783052
## [2,] -6.783052 7.301455
precm%*%corm
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
corm%*%precm
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
We are multipling precision matrix by correlation matrix and correlation matrix by precision matrix. As we can see we are getting an Identity Matrix as a result of both multipications. This is to be expected, although normally we get different awnser depending on the order of how we multiply 2 matricies, in this case the matricies are inverses of each outher, so we end up with a 2x2 Identity matrix as a result.
summary(h1$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -70.000 -8.000 0.000 7.094 11.000 978.000
summary(h1$DepDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -33.000 -3.000 0.000 9.415 9.000 981.000
We can see that minimum values are -70 and -33, so we will shift 71 to make sure minimal value is above 0.
x=h1$ArrDelay+71
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 63.00 71.00 78.09 82.00 1049.00
Histogram of Shifted Data
plot_ly(x=x, type="histogram")
Histogram of Original Data
plot_ly(x=hflights$ArrDelay, type="histogram")
Next Loading Masspackage and fitting to exponential function
require(MASS)
l=fitdistr(x, "exponential")
l
## rate
## 1.280503e-02
## (2.706317e-05)
l$estimate
## rate
## 0.01280503
samples=rexp(1000, l$estimate )
Below are 5th and 95th percentiles using the cumulative distribution function (CDF)
quantile(samples, probs=0.95)
## 95%
## 257.8541
quantile(samples, probs=0.05)
## 5%
## 3.279185
Calculating 95% confidence interval from the empirical data, assuming normality. ( Using z=1.96 (.975) since we are dealing with 2 tails)
mean(x)-1.96*sd(x)
## [1] 17.90564
mean(x)+1.96*sd(x)
## [1] 138.283
Based on above calculations based on 95% Confidece Interval = 17.90564 < M < 138.283
quantile(h1$ArrDelay, probs=0.95)
## 95%
## 57
quantile(h1$ArrDelay, probs=0.05)
## 5%
## -18
Above are 5th percentile and 95th percentile of the data